Back to Article
Keep NA Roster Script
Download Notebook

Keep NA Roster Script

Author

Yunpeng Yu and Sarah Menz

Published

March 6, 2024

In [1]:
knitr::opts_chunk$set(
  eval = F,
  echo = T,
  message = F,
  warning = F
)

The keep_na script uses the keep_na.csv. The csv file is a running file of records that could not be matched to a WDRS event during initial processing.

The keep_na script has 3 main sections
The first section identifies any new records in the keep_na.csv. The new records are grouped by submitting lab and saved to the Communications_for_keep_na folder.
The second section attempts to match the records to WDRS events and roster data. If a record is matched to a WDRS event, it can be rostered. Records that cannot be matched stay in the keep_na.csv.
The third section identifies records that have been in the keep_na.csv for over 60 days. Records older than 60 days are removed from the keep_na.csv file and sent to the Delete folder.

Libraries and Paths

In [2]:
library(DBI)
library(odbc)
library(lubridate)
library(tidyverse)
library(readxl)
library(here)
library(fs)

New records for the communications folder

Previous keep_na

Read in the most recently archived keep_na.csv

In [3]:
# read in r_creds.RDS
r_creds <-readRDS(file.path(Sys.getenv("USERPROFILE"), "Projects/Sequencing/Data_Objects", "r_creds.RDS"))

# Get all filenames in the Archive
path <- file.info(list.files("keep_na/Archive/", full.names = T))

# Identify the file with the most recent date
most_recent_path <- rownames(path)[which.max(path$mtime)]
most_recent_file <- list.files(most_recent_path, pattern = "\\.csv$")


# Read in the most recently archived file
yesterday_keep_na <- read_csv(paste0(most_recent_path, "/", most_recent_file),
                        col_types = cols(.default = "c"),
                        na = c("", "NA", "N/A", "None", "NONE")
                              )

Copy keep_na file

The keep_na file is updated each time lab submissions are processed. This code chunk saves a copy of the keep_na.csv. This is important for identifying new records in the keep_na.csv and can also be used in troubleshooting.

In [4]:
# Create new folder in archive with today's date
newdir <- paste0("keep_na/Archive/keep_na_", format(today(), "%Y-%m-%d"))
dir.create(newdir)

# Save a copy of the current keep_na.csv before running the rest of the script
file.copy( 
  from = "keep_na/keep_na.csv" ,
    to = file.path(paste0((newdir), "/", "keep_na_", format(now(), "%Y-%m-%d-%H%M%S"), ".csv"))
  )

Current keep_na file

Read in the current keep_na.csv file

In [5]:
keep_na<- read_csv("keep_na/keep_na.csv",
                  col_types = cols(.default = "c"),
                  na = c("", "NA", "N/A", "None", "NONE"))

Identify new records

Compare today’s keep_na.csv with the previous version to identify new records. Records are saved in the template submitters format, with no demographic values. The records are split by submitting lab and saved to the communications folder.

In [6]:
# find records that are in today's keep_na and not in yesterday's keep_na

new_records <-  anti_join(keep_na, yesterday_keep_na, 
                          by = c("CASE_ID", "SEQUENCE_REASON", "SEQUENCE_LAB", "SEQUENCE_ACCESSION", 
                                 "SEQUENCE_CLINICAL_ACCESSION"))


# changing to template submitters format
new_records <- new_records %>% distinct() %>% 
  # renaming columns 
  select( LAB_ACCESSION_ID = SEQUENCE_CLINICAL_ACCESSION, 
          GISAID_ID = SEQUENCE_ACCESSION, 
          SPECIMEN_COLLECTION_DATE = SEQUENCE_SPECIMEN_COLLECTION_DATE, 
          SUBMITTING_LAB = SEQUENCE_LAB, 
          SEQUENCE_REASON, 
          SEQUENCE_STATUS,
          PANGO_LINEAGE = SEQUENCE_VARIANT_OPEN_TEXT) %>% 
  # if in keep_na, no demographics
  add_column( FIRST_NAME = NA, 
              LAST_NAME = NA, 
              MIDDLE_NAME = NA, 
              DOB = NA, 
              ALTERNATIVE_ID = NA ) %>% 
  # reordering columns
  relocate( LAB_ACCESSION_ID, 
            GISAID_ID, 
            SPECIMEN_COLLECTION_DATE, 
            SUBMITTING_LAB, 
            SEQUENCE_REASON, 
            SEQUENCE_STATUS, 
            PANGO_LINEAGE, 
            FIRST_NAME, 
            LAST_NAME, 
            MIDDLE_NAME, 
            DOB, 
            ALTERNATIVE_ID )

# save records if there are any
if(length(new_records > 0)){

# split records by lab and save to folder
split_new_records <- split(new_records, new_records$SUBMITTING_LAB)

# folder to save in with today's date
review_path <- dir_create(paste0("For_Review/Communications_for_keep_na/"), today())

# save records in for review folder, by sequence_lab
lapply(names(split_new_records), function(x){
  write_csv(split_new_records[[x]], 
            path = paste0(review_path, "/", names(split_new_records[x]), today(), ".csv"))
  })
}

Send Email - This is an DIQA internal notification of new records and files

In [7]:
# Email sender
email_from <- ""

# Email recipients
email_to <- ""

# Subject line
email_subj <-
  "Sequencing - Keep_NA Part 1 - New Records"

email_body <- paste(

  (nrow(new_records)), "new record(s) in keep_na.csv. \n ",(length(split_new_records)), "file(s) printed to For review/Communications_for_keep_na folder."
  
  )

sendmailR::sendmail(from = email_from,
                    to = email_to,
                    subject = email_subj,
                    msg = email_body,
                    headers= list("Reply-To" = email_from),
                    control = list(smtpServer = ""))

The remainder of script is only run on Wednesdays

In [8]:
# stop script if it isn't Wednesday
if(weekdays(today()) != "Wednesday") {
  stop("Do not run rest of script. Run the rest of the script on Wednesdays.")
}

Roster data

Set-up

WDRS Connection

IMPORTANT the variables used to connect to WDRS are held within conn_list.RDS. All .RDS objects in this repository except for VOC.RDS are excluded from Git commits by declaring *.RDS in the .gitignore file because they are often used to hold our “secrets” such as credential and server connections. We do not include server connections in code uploaded to GitHub. WHY? We have been asked by HTS to ensure our use of GitHub does not raise any security red flags. This server is an internal server containing confidential/restricted PHI. We want to hide this server information to reduce our possible “attack surface”. This connection may seem benign but it tells someone information they can use to “hack” into WDRS. SQL Server Native Client is now deprecated software and version 11 was the last release. Unsupported software is at higher risk of having security breaches. Additionally, someone would know the server name. So: DO NOT alter the code used to open the connection to WDRS in any way that creates a security risk. Continue to treat this connection as a secret and store its variables in a .RDS object (or other external object that is excluded from Git commits) rather than calling them directly here.

In [9]:
# connect
connection <- DBI::dbConnect(odbc::odbc(), 
                             Driver = r_creds$conn_list[1], 
                             Server = r_creds$conn_list[2], 
                             Database = r_creds$conn_list[3], 
                             Trusted_connection = r_creds$conn_list[4], 
                             ApplicationIntent = r_creds$conn_list[5])

WDRS queries

The DD_ELR_DD_ENTIRE query is used to match records and find a record’s case_id. The DD_GCD_COVID_19_FLATTENED query is used for QA checks.

In [10]:
# pull CASE_ID, SEQUENCE_CLINICAL_ACCESSION, COLLECTION_DATE and SPECIMEN__ID__ACCESSION__NUM__MANUAL from the [DD_ELR_DD_ENTIRE] table
wdrs_ent <- dbGetQuery(connection, "
  SELECT Distinct CASE_ID,
    [FILLER__ORDER__NUM] as SEQUENCE_CLINICAL_ACCESSION,
    SPECIMEN__COLLECTION__DTTM as COLLECTION_DATE,
    SPECIMEN__ID__ACCESSION__NUM__MANUAL
  FROM [dbo].[DD_ELR_DD_ENTIRE]
  WHERE CODE = 'SARS' AND STATUS != '6'
  ")

wdrs_ent$COLLECTION_DATE <- as.character(wdrs_ent$COLLECTION_DATE) 

# pull all SEQUENCE_ACCESSION from the [DD_GCD_COVID_19_FLATTENED] table. (these are specimens that have already been rostered)
wdrs_flat <- dbGetQuery(connection, "
  SELECT DISTINCT CDC_N_COV_2019_SEQUENCE_ACCESSION_NUMBER,
  CDC_N_COV_2019_SEQUENCE_CLINICAL_ACCESSION_NUMBER
  FROM [dbo].[DD_GCD_COVID_19_FLATTENED]
  ")

Create vectors of column names

These are referenced for joining tables and selecting column names.

  • all_cols: This includes roster columns, WDRS collection date, and date processed. It includes all columns required for the roster_filters function.
  • keepna_cols: All columns in the keep_na.csv (roster columns and date processed). If a script outputs additional columns to the keep_na, the csv file will need to be corrected. The extra columns for new records usually get skipped over when reading in the file and the columns need to be in the correct order. If there are strange values in columns, check if an upstream script has had recent changes.
  • roster_cols: Roster columns
In [11]:
all_cols <- c("CASE_ID",
"SEQUENCE_SGTF",
"SEQUENCE_SPECIMEN",
"SEQUENCE_REASON",
"SEQUENCE_DATE",
"SEQUENCE_LAB",
"SEQUENCE_STATUS",
"SEQUENCE_REPOSITORY",
"SEQUENCE_ACCESSION",
"SEQUENCE_VARIANT_OPEN_TEXT",
"SEQUENCE_CLINICAL_ACCESSION",
"SEQUENCE_SPECIMEN_COLLECTION_DATE",
"SEQUENCE_NOTES",
"SEQUENCE_REVIEWED",
"Case.Note",
"COLLECTION_DATE_WDRS",
"DATE_PROCESSED")

keepna_cols <- c("CASE_ID",
"SEQUENCE_SGTF",
"SEQUENCE_SPECIMEN",
"SEQUENCE_REASON",
"SEQUENCE_DATE",
"SEQUENCE_LAB",
"SEQUENCE_STATUS",
"SEQUENCE_REPOSITORY",
"SEQUENCE_ACCESSION",
"SEQUENCE_VARIANT_OPEN_TEXT",
"SEQUENCE_CLINICAL_ACCESSION",
"SEQUENCE_SPECIMEN_COLLECTION_DATE",
"SEQUENCE_NOTES",
"SEQUENCE_REVIEWED",
"Case.Note",
"DATE_PROCESSED")

roster_cols <- c("CASE_ID",
"SEQUENCE_SGTF",
"SEQUENCE_SPECIMEN",
"SEQUENCE_REASON",
"SEQUENCE_DATE",
"SEQUENCE_LAB",
"SEQUENCE_STATUS",
"SEQUENCE_REPOSITORY",
"SEQUENCE_ACCESSION",
"SEQUENCE_VARIANT_OPEN_TEXT",
"SEQUENCE_CLINICAL_ACCESSION",
"SEQUENCE_SPECIMEN_COLLECTION_DATE",
"SEQUENCE_NOTES",
"SEQUENCE_REVIEWED",
"Case.Note")

WDRS Sequence Accession values

The flattened table can include multiple values in a single row. The SEQUENCE_ACCESSION values are split at each “,” to create a single vector of values. The list of SA values is used in the roster_filters function to check if a record’s SA already exists in WDRS.

In [12]:
# for fields that have multiple comma separated SEQUENCE_ACCESSION's split them by ","
wdrs_sa_flat_split <- unlist(str_split(wdrs_flat[[1]], ","))

# omit any NA's
wdrs_sa_flat_clean <- wdrs_sa_flat_split[!is.na(wdrs_sa_flat_split)] %>%
#for fields that have "hCoV-19/" appended to the beginning of the SEQUENCE_ACCESSION remove it by str_replace() with ""
  str_replace("hCoV-19/", "") %>%
# trim off the white space resulting from str_split, this also gets rid of " " values  
  str_trim("both")

# remove any values that are ""
wdrs_sa_flat_values <- wdrs_sa_flat_clean[wdrs_sa_flat_clean != ""]

WDRS Sequence Clinical Accession values

The SEQUENCE_CLINICAL_ACCESSION values are split at each “,” to create a single vector of values. The list of SCA values is used in the roster_filters function to check if a record’s SCA already exists in WDRS.

In [13]:
# for fields that have multiple comma separated SEQUENCE_CLINICAL_ACCESSION's split them by ","
wdrs_sca_flat_split <- unlist(str_split(wdrs_flat[[2]], ","))

# omit any NA's
wdrs_sca_flat_clean <- wdrs_sca_flat_split[!is.na(wdrs_sca_flat_split)] %>%
#for fields that have "hCoV-19/" appended to the beginning of the SEQUENCE_CLINICAL_ACCESSION remove it by str_replace() with ""
  str_replace("hCoV-19/", "") %>%
# trim off the white space resulting from str_split, this also gets rid of " " values  
  str_trim("both")

# remove any values that are ""
wdrs_sca_flat_values <- wdrs_sca_flat_clean[wdrs_sca_flat_clean != ""]

Load WA GISAID data

WA GISAID file is used to update the lineage information for matched records.

In [14]:
# gisaid_metadata <- read_tsv(file.path(project_folder, "GISAID Data\\wa_gisaid_data.tsv"), 
#                             col_types = cols(.default = "c"),
#                            na = c("", "NA", "None", "NONE") 
#                             )

gisaid_metadata <- read_rds("GISAID Data\\wa_gisaid.rds") %>%
  as_tibble()

Load WA CDC Cumulative data

WA CDC Cumulative file is also used to update lineage information.

In [15]:
CDC_cumulative <- read_csv("Submissions/CDC_Cumulative/Washington_cumulative.csv", 
                            col_types = cols(.default = "c"),
                            na = c("", "NA", "None", "NONE")
                            )

Load sequence reasons, laboratories, and lineages

Both of these files are used in QA checks The lab_variables.rds file is created in Roster_scripts/write_lab_variables.R. It includes valid sequence reasons and lab names and is used across multiple scripts. Please refer to the tables on the GitHub wiki for updates.
The lineages.csv file contains all lineages, both active and withdrawn.

In [16]:
# bring in object with sequence reasons and sequence laboratories
lab_vars <- readRDS("Data_Objects/lab_variables.rds")

# bring in file with all lineages (active or withdrawn)
lineages <- read_csv("Data_Objects/Lineages/Lineages.csv", col_types = "c")

Quality filter functions

The quality filters script includes the roster_filters function for QA checks.

In [17]:
# Read in quality_filters
source(file.path("C:/Users/", Sys.getenv("USERNAME"), "Projects/Sequencing/Roster_scripts/quality_filters.R"))

Matching to WDRS

Remove duplicates, low quality, failed, or pending records

Duplicates and non-PHL records that are failed or low quality are dropped from the keep_na.

In [18]:
keep_na_keep <- keep_na %>% 
  # remove records SEQUENCE_STATUS as PENDING 
  filter(SEQUENCE_STATUS != "PENDING") %>% 
  # remove FAILED or LOW QUALITY records from non-PHL submitters
  filter(SEQUENCE_LAB == "PHL" | !SEQUENCE_STATUS %in% c("FAILED", "LOW QUALITY")) %>% 
  # remove duplicates
  distinct(across(all_of(roster_cols)), .keep_all = TRUE)

Create column for joining

To match records, the SEQUENCE_ACCESSION needs to be reformatted to the GISAID ID. This also checks the format for SEQUENCE_SPECIMEN_COLLECTION_DATE and SEQUENCE_LAB. If a record from a CDC lab has a blank SEQUENCE_REASON, it’s populated with SENTINEL SURVEILLANCE. When all processes are finalized, the SEQUENCE_REASON, SEQUENCE_LAB, and/or SEQUENCE_SPECIMEN_COLLECTION_DATE mutate statements could likely be removed (and any problems will be handled within the for review process.)

In [19]:
keep_na_mut <- keep_na_keep %>%
  # Reformat SEQUENCE_REASON and SEQUENCE_LAB
  mutate(SEQUENCE_REASON = case_when(
                    is.na(SEQUENCE_REASON) & SEQUENCE_LAB %in% lab_vars$cdc_labs ~ 
                      "SENTINEL SURVEILLANCE",
                    TRUE ~ 
                      str_to_upper(SEQUENCE_REASON))) %>% 
  mutate(SEQUENCE_LAB = case_when(
                    str_detect(toupper(SEQUENCE_LAB), "AEGIS") ~ "Aegis",
                    str_detect(toupper(SEQUENCE_LAB), "OREGON") ~ "OHSU",
                    str_detect(toupper(SEQUENCE_LAB), "SCAN/BEDFORD") ~ "NW Genomics",
                    TRUE ~ 
                      SEQUENCE_LAB)) %>%
  # Format dates across wdrs_ent, gisaid_metadata and CDC ("08/05/2021")
  mutate(SEQUENCE_SPECIMEN_COLLECTION_DATE =
               case_when(
                 !str_detect(SEQUENCE_SPECIMEN_COLLECTION_DATE, "^[[:digit:]]{5}$") ~
                   as.character(format(parse_date_time(SEQUENCE_SPECIMEN_COLLECTION_DATE,  
                                                       c("mdy", "ymd")), "%m/%d/%Y")),
                 str_detect(SEQUENCE_SPECIMEN_COLLECTION_DATE, "^[[:digit:]]{5}$") ~
                   as.character(openxlsx::convertToDate(SEQUENCE_SPECIMEN_COLLECTION_DATE)
                      ))) %>% 
  # Create a new column GISAID_ID to match against GISAID and CDC
  mutate(GISAID_ID = case_when(
                    str_detect(toupper(SEQUENCE_ACCESSION), "^USA") ~ 
                        paste0("hCoV-19/", SEQUENCE_ACCESSION),
                    TRUE ~ 
                      SEQUENCE_ACCESSION))


# Check if anything is still in Excel date format

if (any(na.omit(str_detect(keep_na_mut$SEQUENCE_SPECIMEN_COLLECTION_DATE, "^[[:digit:]]{5}$")))){
  stop(paste0("Please review date column. There is a date in Excel format"))
}

# With the mutate statement for SEQUENCE_SPECIMEN_COLLECTION_DATE, there will be a warning about NAs introduced by coercion. These were probably NAs before the mutate statement and don't need any further investigation. If there are more NAs in the date column after the mutate statement, check if there's a new date format that's being used and not being captured in the format options.

if (any(keep_na_keep %>% count(is.na(SEQUENCE_SPECIMEN_COLLECTION_DATE)) !=  keep_na_mut %>%
        count(is.na(SEQUENCE_SPECIMEN_COLLECTION_DATE)))){
  stop(paste0("Please review date column. NAs have been introduced after date parsing. Review keep_na_keep for new date formats."))
}

Join with WDRS_ENTIRE table to get CASE_ID

Match the two tables using SEQUENCE_CLINCIAL_ACCESSION and SEQUENCE_SPECIMEN_COLLECTION_DATE. If a record is matched and the submitted collection date is within 14 days of the WDRS colleciton date, the case_id is added to the record. Records that already have a CASE_ID are not updated.

In [20]:
# Join keep_na with wdrs entire table
keep_na_ent <- keep_na_mut %>%
  # Get CASE_ID by matching with SEQUENCE_CLINICAL_ACCESSION and 14 day collection day window
  left_join(wdrs_ent %>% 
              rename(CASE_ID_SCA=CASE_ID, COLLECTION_DATE_WDRS = COLLECTION_DATE), 
            by = c("SEQUENCE_CLINICAL_ACCESSION"), na_matches="never") %>%
  distinct()

# Check for additional errors
keep_na_review_errors <-  
  roster_filters(keep_na_ent, lab_vars, wdrs_sa_flat_values, wdrs_sca_flat_values, lineages$lineage_extracted)

  
# If the difference in collection date is less than 14 days, the case_id can be populated from WDRS
# This filters out the records that have a SEQUENCE_CLINICAL_ACCESSION match in WDRS but the collection date range is more than 14 days
  keep_na_case <- keep_na_review_errors %>% 
    mutate(CASE_ID = case_when(
     # If case_id is missing and a case_id is found from wdrs with a collection date within a 14 day window, use the case_id from wdrs
        is.na(CASE_ID)  & !is.na(CASE_ID_SCA) & is.na(QA_COLLECT_DATE) ~ CASE_ID_SCA, 
        TRUE ~ CASE_ID))

Update SEQUENCE_VARIANT_OPEN_TEXT

Match the keep_na records to the GISAID file and the CDC file and format the collection date variable. If the record is matched, update the SEQUENCE_OPEN_TEXT_VARIANT, SEQUENCE_STATUS, and SEQUENCE_SPECIMEN_COLLECTION_DATE. The SEQUENCE_OPEN_TEXT_VARIANT should be populated using the variant in the GISAID or CDC files over what is already in SEQUENCE_OPEN_TEXT_VARIANT column.

In [21]:
keep_na_var <- keep_na_case %>%
  # Join columns from gisaid_metadata, match on "GISAID_ID" = "Virus name" to join lineage column
  left_join(gisaid_metadata[c("Virus name", "Lineage", "Collection date")], 
            by = c("GISAID_ID" = "Virus name")) %>%
  # Join columns from CDC cumulative, match on "GISAID_ID" = "GISAID_name" to join lineage_PANGO_lineage column
  left_join(CDC_cumulative[c("GISAID_name", "lineage_PANGO_lineage", "collection_date")], 
            by = c("GISAID_ID" = "GISAID_name")) %>%
  rename("COLLECTION_DATE_GISAID" = "Collection date", 
         "COLLECTION_DATE_CDC" = "collection_date") %>%
  # Formatted date to be in m/d/y format and check for Excel date format
  mutate(COLLECTION_DATE_GISAID =
          case_when(
                 !str_detect(COLLECTION_DATE_GISAID, "^[[:digit:]]{5}$") ~
                   as.character(format(parse_date_time(COLLECTION_DATE_GISAID,  
                                                       c("mdy", "ymd")), "%m/%d/%Y")),
                  str_detect(COLLECTION_DATE_GISAID, "^[[:digit:]]{5}$") ~
                    as.character(openxlsx::convertToDate(COLLECTION_DATE_GISAID)
                    )),
        COLLECTION_DATE_CDC =
           case_when(
                 !str_detect(COLLECTION_DATE_CDC, "^[[:digit:]]{5}$") ~ 
                   as.character(format(parse_date_time(COLLECTION_DATE_CDC,  
                                                       c("mdy", "ymd")), "%m/%d/%Y")),
                  str_detect(COLLECTION_DATE_CDC, "^[[:digit:]]{5}$") ~ 
                    as.character(openxlsx::convertToDate(COLLECTION_DATE_CDC)
                    )) 
                  )


keep_na_var <- keep_na_var %>%  
  mutate(SEQUENCE_VARIANT_OPEN_TEXT = 
           # Combine lineage information to one column
           case_when(
             # Update SEQUENCE_VARIANT_OPEN_TEXT column using lineage_PANGO_lineage and Lineage
             # If a CDC_lab, use lineage_PANGO_lineage, otherwise use Lineage
              !is.na(lineage_PANGO_lineage) & lineage_PANGO_lineage != "None" & 
                SEQUENCE_LAB %in% lab_vars$cdc_labs ~
                      lineage_PANGO_lineage,
              !is.na(Lineage)  & Lineage != "None" & 
                !(SEQUENCE_LAB %in% lab_vars$cdc_labs)  ~ 
                      Lineage,
              !is.na(lineage_PANGO_lineage)  & lineage_PANGO_lineage != "None" ~ 
                      lineage_PANGO_lineage,
              !is.na(Lineage)  & Lineage != "None" ~ 
                      Lineage,
                     TRUE ~ 
                NA_character_
                     )) %>% 
  
  # SEQUENCE_VARIANT_OPEN_TEXT needs to be a valid lineage
  mutate(SEQUENCE_VARIANT_OPEN_TEXT = 
           case_when(SEQUENCE_VARIANT_OPEN_TEXT %in% lineages$lineage_extracted ~ 
                       SEQUENCE_VARIANT_OPEN_TEXT,
           TRUE ~ NA_character_)) %>% 
  
    # Update SEQUENCE_STATUS to COMPLETE if Lineage is present
    mutate(SEQUENCE_STATUS = 
             case_when(
              is.na(SEQUENCE_STATUS) & !is.na(SEQUENCE_VARIANT_OPEN_TEXT) ~ 
                "COMPLETE",
              TRUE ~ SEQUENCE_STATUS)) %>%
  
  # Update SEQUENCE_SPECIMEN_COLLECTION_DATE
  mutate(SEQUENCE_SPECIMEN_COLLECTION_DATE = 
    case_when(
      
    # If date is missing and observation is from a CDC lab, use collection date from CDC cumulative 
    is.na(SEQUENCE_SPECIMEN_COLLECTION_DATE) & !is.na(COLLECTION_DATE_CDC) & 
      SEQUENCE_LAB %in% lab_vars$cdc_labs ~ 
      COLLECTION_DATE_CDC,
    
    #  If date is missing and observation is from a CDC lab, use collection date from gisaid
    is.na(SEQUENCE_SPECIMEN_COLLECTION_DATE) & !is.na(COLLECTION_DATE_GISAID) & 
      !(SEQUENCE_LAB %in% lab_vars$cdc_labs) ~ 
      COLLECTION_DATE_GISAID,
    
    # If date is missing, use collection date from CDC cumulative 
    is.na(SEQUENCE_SPECIMEN_COLLECTION_DATE) & !is.na(COLLECTION_DATE_CDC) ~ 
      COLLECTION_DATE_CDC,
    
    # If date is missing, use collection date from gisaid
    is.na(SEQUENCE_SPECIMEN_COLLECTION_DATE) & !is.na(COLLECTION_DATE_GISAID)  ~ 
      COLLECTION_DATE_GISAID,
    
         TRUE ~ SEQUENCE_SPECIMEN_COLLECTION_DATE)) %>% 
    distinct()

Create preliminary roster

All roster variables are updated here. SEQUENCE_NOTES is updated to match new SEQUENCE_VARIANT_OPEN_TEXT

In [22]:
# Should use the most up-to-date lineage from the CDC_cumulative or wa_gisaid files
# The updated lineage should be used in SEQUENCE_VARIANT_OPEN_TEXT and SEQUENCE_NOTES
# If SEQUENCE_STATUS is COMPLETE, NOTES need be present to be rostered
# Some NOTES show lineage as none or NA. These should be replaced from one of the lineage columns (Lineage, lineage_PANGO_lineage, or SEQUENCE_VARIANT_OPEN_TEXT) or changed to blank if those are missing


keep_na_roster_check <- keep_na_var %>%
  mutate(SEQUENCE_SGTF = "") %>%
  mutate(SEQUENCE_SPECIMEN = "YES") %>%
  mutate(SEQUENCE_DATE = "") %>% 
  mutate(SEQUENCE_REPOSITORY = "GISAID") %>% 
  mutate(SEQUENCE_REVIEWED = "") %>%
  mutate(Case.Note = "External data question package updated by COVID19 Sequencing Roster.") %>% 
  mutate(SEQUENCE_ACCESSION=str_replace(SEQUENCE_ACCESSION,"hCoV-19/", "")) %>% 
    mutate(SEQUENCE_NOTES = 
             case_when(
    # If notes missing, fill in with lineage
    is.na(SEQUENCE_NOTES) & !is.na(SEQUENCE_VARIANT_OPEN_TEXT) ~ 
         paste0("Lineage identified as ", SEQUENCE_VARIANT_OPEN_TEXT, " on ",
            today(), ". Lineage assignments may change over time."),
    
   # If notes say lineage is none or NA, update with lineage
   (str_detect(toupper(SEQUENCE_NOTES), "LINEAGE IDENTIFIED AS NONE") | 
      str_detect(toupper(SEQUENCE_NOTES), "LINEAGE IDENTIFIED AS NA")) & 
     !is.na(SEQUENCE_VARIANT_OPEN_TEXT)   ~
        paste0("Lineage identified as ", SEQUENCE_VARIANT_OPEN_TEXT, " on ", 
           today(), ". Lineage assignments may change over time."),
   
   # If notes exist by lineage needs to be updated
    !str_detect (SEQUENCE_NOTES, SEQUENCE_VARIANT_OPEN_TEXT)   & 
     !is.na(SEQUENCE_VARIANT_OPEN_TEXT)  ~  
        paste0("Lineage identified as ", SEQUENCE_VARIANT_OPEN_TEXT, " on ", 
             today(), ". Lineage assignments may change over time."),
   
   # If notes say lineage was identified as none or na and there isn't a lineage, change to blank
   (str_detect(toupper(SEQUENCE_NOTES), "LINEAGE IDENTIFIED AS NONE") | 
      str_detect(toupper(SEQUENCE_NOTES), "LINEAGE IDENTIFIED AS NA")) &  
     (is.na(SEQUENCE_VARIANT_OPEN_TEXT)) ~ 
        "",
   
   # Change NA to blank
    is.na(SEQUENCE_NOTES) ~ 
        "",
    TRUE ~ SEQUENCE_NOTES)) 



keep_na_roster_1 <- keep_na_roster_check %>%
  select(all_of(all_cols))

Quality filters

Run quality checks on records. The majority of records usually stay in the keep_na.csv, so don’t be alarmed by a high number of flags.

In [23]:
keep_na_qa <- roster_filters(keep_na_roster_1, lab_vars, wdrs_sa_flat_values, wdrs_sca_flat_values, lineages$lineage_extracted)

Final outputs

Save records that are ready to be rostered, that have non-keepna quality issues, or that need to stay in the keepna.csv

In [24]:
# Output roster in general for all labs

keep_na_roster_final <- keep_na_qa %>% 
                          filter(sum == 0) %>%  
                          select(all_of(roster_cols))


if(nrow(keep_na_roster_final) > 0) {
  keep_na_roster_final %>%
    write_csv(paste0(file.path("write_roster_here/keep_naNewSeq_"), format(now(), "%Y-%m-%d-%H%M%S"), ".csv"), na = "")
}


# Output rows with error for review
# Rows missing case_id, missing sequence clinical accession, or missing sequence accession shouldn't be sent to for review folder (stay in keep_na file for 60 days before deleting)

keep_na_review <- keep_na_qa %>% 
  filter(is.na(QA_CASE_ID) & 
           is.na(QA_SCA_NA) &
           is.na(QA_SEQ_STAT)) 


if(nrow(keep_na_review) > 0) {
  keep_na_review %>%
    select(-DATE_PROCESSED, -sum) %>% 
  write_csv(paste0("For_Review/to_process/keep_na_review", format(now(), "%Y-%m-%d-%H%M%S"), ".csv"), na = "")
  }


# Outputs rows that don't meet criteria to the keep_na_final file
keep_na_final <- keep_na_roster_1 %>% 
  anti_join(keep_na_roster_final, by=roster_cols) %>% 
  anti_join(keep_na_review, by=roster_cols) %>% 
  distinct() %>% 
  select(all_of(keepna_cols))

# Save the keep_na csv file here if it isn't the first week of the month
if(day(today()) >= 8) {
# Output keep_na_final as running keep_na
if(nrow(keep_na_final) > 0) {
 keep_na_final %>%
   write_csv(paste0("keep_na/keep_na.csv"), na = "")
 }
}

Save files

These need to be used to replicate the results from a specific day’s run (copy of keep_na from that day is also saved here)

In [25]:
saveRDS(CDC_cumulative, (paste0(file.path(newdir), "/", "wa_cdc", format(now(), "%Y-%m-%d-%H%M%S"), ".RDS")))
saveRDS(gisaid_metadata, (paste0(file.path(newdir), "/", "gisaid_wa", format(now(), "%Y-%m-%d-%H%M%S"), ".RDS")))

Send email - DIQA internal notification of keep_na records sent to write_roster_here or for_review

In [26]:
# Subject line
email_subj2 <-
  "Sequencing - Keep_NA Part 2 - Match keep_na to WDRS"

email_body2 <- paste(
  
  (nrow(keep_na_roster_final)), "keep_na record(s) matched to WDRS and sent to writer_roster_here.\n",(nrow(keep_na_review)), "keep_na records matched to WDRS and sent to for_review."
  
  )

sendmailR::sendmail(from = email_from,
                    to = email_to,
                    subject = email_subj2,
                    msg = email_body2,
                    headers= list("Reply-To" = email_from),
                    control = list(smtpServer = ""))

Remove old records

Identify records that were sent to the keep_na.csv over 60 days ago. A column is added with notes about why the record couldn’t be rostered. The notes include details about:

  • if SCA is missing or not in WDRS
  • if SA is missing, not in GISAID, or not in CDC
  • if CASE_ID is missing
In [27]:
# DATE_PROCESSED variable in different formats
keep_na_final <- keep_na_final %>% mutate(DATE_PROCESSED = 
                                            as.character(format(parse_date_time(DATE_PROCESSED, c("mdy", "ymd")), "%m/%d/%Y")))

# Currently, we only need to run this once a month (if this needs to be run more often, modify the if statement)
if(day(today()) < 8) {
  
# Make a table of samples that are less than 60 days old
  keep_na_save <- keep_na_final %>% 
    mutate(DATE_PROCESSED = 
             as.Date(DATE_PROCESSED, "%m/%d/%Y")) %>% 
    filter(DATE_PROCESSED >= today()-days(60)) %>% 
    mutate(DATE_PROCESSED = 
             as.character(format(parse_date_time(DATE_PROCESSED, c("mdy", "ymd")), "%m/%d/%Y"))) %>%
    select(all_of(keepna_cols))
  
# Finding observations that are > 60 days old and adding notes about why can't be rostered for further investigation
  keep_na_delete <- keep_na_final %>%
     # Remove any samples that are less than 60 days old
    anti_join(keep_na_save, by=roster_cols) %>%
    mutate(
      
      # SEQUENCE_CLINICAL_ACCESSION is not in wdrs_ent or is missing
      condition_1 = 
        case_when(is.na(SEQUENCE_CLINICAL_ACCESSION)  ~ 
                         "SCA missing",
                  (!(SEQUENCE_CLINICAL_ACCESSION %in% wdrs_ent$SEQUENCE_CLINICAL_ACCESSION)|
                   !(SEQUENCE_CLINICAL_ACCESSION %in% wdrs_ent$SPECIMEN__ID__ACCESSION__NUM__MANUAL))  ~ 
                        "SCA not in WDRS"),
      
      # SEQUENCE_ACCESSION missing, not in GISAID, not in CDC cumulative
      condition_2 = 
        case_when(is.na(SEQUENCE_ACCESSION) ~ 
                      "SA missing",
                  
      # SEQUENCE_ACCESSION is not in GISAID for non-CDC labs
                  !(SEQUENCE_LAB %in% lab_vars$cdc_labs) & 
                  !(paste0("hCoV-19/", SEQUENCE_ACCESSION) %in% gisaid_metadata$`Virus name`)  ~ 
                      "SA not in GISAID",
      
      # SEQUENCE_ACCESSION is not in CDC cumulative for CDC labs
                    (SEQUENCE_LAB %in% lab_vars$cdc_labs) & 
                    !(paste0("hCoV-19/", SEQUENCE_ACCESSION) %in% CDC_cumulative$GISAID_name)  ~ 
                      "SA not in CDC cumulative"),
      
      # CASE_ID is missing
      condition_3 = case_when(is.na(CASE_ID) ~ "CASE_ID missing")
           ) %>%
    
  # combine notes into one column
  unite("NOTES", condition_1:condition_3, remove = TRUE, sep = " / ", na.rm = TRUE) %>% 
    arrange(SEQUENCE_LAB)
  
  if(nrow(keep_na_save) > 0) {
        keep_na_save %>%
          write_csv("keep_na/keep_na.csv", na = "")
  }

  if(nrow(keep_na_delete) > 0) {
        keep_na_delete %>%
          write_csv(paste0("keep_na/Delete/keep_na_delete_", format(now(), "%Y-%m-%d"), ".csv"), na = "")
  }
}

Send email - DIQA internal notification of monthly keep_na_delete file

In [28]:
# Subject line
email_subj3 <-
  "Sequencing - Keep_NA Part 3 - Remove Old Records"

email_body3 <- paste(
  
  (nrow(keep_na_delete)), "record(s) written to keep_na_delete."
  
  )

sendmailR::sendmail(from = email_from,
                    to = email_to,
                    subject = email_subj3,
                    msg = email_body3,
                    headers= list("Reply-To" = email_from),
                    control = list(smtpServer = ""))